Application of machine learning algorithms to the study of noise artifacts in gravitational-wave data 
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The sensitivity of searches for astrophysical transients in data from the Laser Interferometer Gravitational- 
wave Observatory (LIGO) is generally limited by the presence of transient, non-Gaussian noise artifacts, which 
occur at a high-enough rate such that accidental coincidence across multiple detectors is non-negligible. Further- 
more, non-Gaussian noise artifacts typically dominate over the background contributed from stationary noise. 
These “glitches” can easily be confused for transient gravitational-wave signals, and their robust identification 
and removal will help any search for astrophysical gravitational-waves. We apply Machine Learning Algorithms 
(MLAs) to the problem, using data from auxiliary channels within the LIGO detectors that monitor degrees of 
freedom unaffected by astrophysical signals. Terrestrial noise sources may manifest characteristic disturbances 
in these auxiliary channels, inducing non-trivial correlations with glitches in the gravitational-wave data. The 
number of auxiliary-channel parameters describing these disturbances may also be extremely large; high dimen- 
sionality is an area where MLAs are particularly well-suited. We demonstrate the feasibility and applicability of 
three very different MLAs: Artificial Neural Networks, Support Vector Machines, and Random Forests. These 
classifiers identify and remove a substantial fraction of the glitches present in two very different data sets: four 
weeks of LIGO’s fourth science run and one week of LIGO’s sixth science run. We observe that all three al- 
gorithms agree on which events are glitches to within 10% for the sixth science run data, and support this by 
showing that the different optimization criteria used by each classifier generate the same decision surface, based 
on a likelihood-ratio statistic. Furthermore, we find that all classifiers obtain similar limiting performance, sug- 
gesting that most of the useful information currently contained in the auxiliary channel parameters we extract is 
already being used. Future performance gains are thus likely to involve additional sources of information, rather 
than improvements in the MLAs themselves. 


I. INTRODUCTION 

The Laser Interferometer Gravitational-wave Observa- 
tory (LIGO) is a two-site network of ground-based detec- 
tors designed for the direct detection and measurement of 
gravitational-wave signals from astrophysical sources [1, 2]. 
The LIGO detectors, in their initial configuration [1], have op- 
erated since 2001 and conducted several scientific runs, col- 
lecting data with incrementally increased sensitivity in each 
run [1, 3, 4], Although no gravitational waves were detected, 
these runs tested and refined key technologies, as well as pro- 
vided a large amount of data characterizing the detectors. The 
next generation of detectors, referred to as the advanced LIGO 
detectors, are currently under construction and are expected to 
be operational by 2015 [1, 2]. Major upgrades to lasers, op- 
tics, and seismic isolation/sensing will provide roughly a fac- 
tor of ten improvement to sensitivity, which corresponds to a 
factor of 1000 in the observable volume of space and the num- 
ber of detectable sources. Based on our current knowledge of 
potential astrophysical sources, the advanced LIGO detectors 
are expected to make routine gravitational-wave detections 
(see, for example, [5]) and will open the era of gravitational- 
wave astronomy. 


The LIGO detector noise may be characterized by an ap- 
proximately stationary component of colored Gaussian noise, 
with the addition of short duration non-Gaussian noise ar- 
tifacts called “glitches” (other noise sources such as non- 
stationary lines and broadband non-stationarity do not always 
fit neatly into this framework). The stationary noise in the 
instrument is dominated by low-frequency seismic noise cou- 
pling to mirror motion, thermal noise in the mirrors and sus- 
pensions, 60 Hz power lines and harmonics, and shot noise. 
Sources of transient noise can include temporary seismic, 
acoustic, or magnetic disturbances, power transients, scattered 
light, dust crossing the beam, instabilities in the interferome- 
ter, channel saturations, and other complicated and often non- 
linear effects. To monitor these disturbances and keep the in- 
strument in a stable operating condition through active feed- 
back, each detector records hundreds of auxiliary channels 
along with the gravitational-wave channel. These auxiliary 
channels keep track of important non-gravitational-wave de- 
grees of freedom in the interferometer, as well as information 
about the local environment. They are critical to understand- 
ing the state of the instrument at any particular time. 

One of LIGO’s main scientific goals is the detection of 
transient gravitational-wave signals, which can come from 
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the coalescence of a compact binary or core-collapse super- 
nova, among other astrophysical sources [6]. The presence 
of glitches is problematic for searches targeting these sig- 
nals because glitches can be easily confused with transient 
gravitational -wave signals. The primary method to distin- 
guish a real gravitational-wave transient from instrumental ar- 
tifact is to check that a signal appears in two or more geo- 
graphically distant detectors. While this coincidence require- 
ment is extremely effective, a high rate of glitches means 
that accidental coincidence of noise transients across multi- 
ple detectors still dominates the search background, result- 
ing in weaker upper limits and making the confident detec- 
tion of real signals challenging. The problem is most se- 
vere in searches for transients with poorly-modeled or little 
identifying waveform structure, such as generic gravitational- 
wave bursts or intermediate-mass binary black-hole coales- 
cence which spend only a short amount of time (few cycles) 
in the LIGO sensitive band. 

While the precise noise characteristics in the advanced de- 
tectors will be different from those of initial LIGO, glitch 
sources for future data will exist and the detection problem 
for short duration signals will persist. Thus, it is critical to 
develop data analysis methods for the robust identification of 
glitches in LIGO data. For previous investigations of instru- 
mental artifacts, their impact on gravitational-wave searches 
and the methods for their identification, see [7-14]. We use 
the Ordered Veto List (OVL) algorithm as a benchmark for 
our investigations. OVL has been used in recent LIGO sci- 
ence runs as one of the primary glitch detection algorithms. 
In particular, an earlier version of OVL described in [15] was 
used during LIGO’s fifth science run [16]. OVL attempts to 
measure the degree of likelihood that a gravitational wave can- 
didate can be associated with a transient instrumental distur- 
bance found in one of the many auxiliary channels. 

Glitches are induced by the detector’s environment, noise in 
the detector subsystems, or a combination of thereof. These 
sources should appear in the auxiliary channels as well. Be- 
cause legitimate gravitational waves may couple to channels 
besides the nominal gravitational wave channel, we use a sub- 
set of auxiliary channels shown to be insensitive to gravita- 
tional waves. This subset is generated through hardware in- 
jections at the detectors [13]. The hardware injections involve 
driving the test masses through magnetic coupling with an ex- 
pected gravitational wave signal and searching for evidence 
of that signal in auxiliary channels. If the signal does not 
systematically appear in an auxiliary channel, that channel is 
deemed “safe” and we include it in our analysis. By analyz- 
ing information from these auxiliary channels, one may be 
able to distinguish glitches from genuine gravitational-wave 
signals and ideally establish their cause. The main difficulty 
in such an analysis is processing the information from hun- 
dreds of channels which may manifest non-trivial correlations 
between themselves when they respond to an instrumental 
disturbance. Given the high dimensionality and the absence 
of reliable statistical models for noise and coupling between 
auxiliary channels, traditional computational methods are not 
well-suited to this problem. On the other hand, Machine learn- 
ing algorithms (MLA)s have been used to solve problems like 


this since the 1970’s in other fields such as computer science, 
biology, and finance. 

This paper presents the use of MLAs for the purpose of 
glitch identification in gravitational-wave detectors. The main 
goal of the paper is to establish the feasibility of apply- 
ing MLAs in the context of the LIGO detectors. We con- 
sider three well-known algorithms: the Artificial Neural Net- 
work (ANN), the Support Vector Machines (SVM), and the 
Random Forest (RF). We explore their properties and test 
their performance by analyzing data from past scientific runs. 
Based on these tests, we discuss the prospects for using MLAs 
for glitch identification in the advanced LIGO detectors. 

This paper is organized as follows. In Section II, we 
describe the process for reducing raw time-series data and 
preparing feature vectors for the MLA classifiers. This is fol- 
lowed by a general formulation of the glitch detection prob- 
lem in Section III. Then, in Section IV, we briefly describe the 
classifiers’ algorithms. Training and testing of the classifiers 
is discussed in Section V. Finally, we evaluate and compare 
the classifiers’ performances using the standard Receiver Op- 
erating Characteristic (ROC) curves in Section VI and inves- 
tigate various ways of combining classifiers in Section VII. In 
Appendix A, we explore several optimization criteria used by 
the classifiers and verify their theoretical consistency. 


II. DATA PREPARATION 

We use data taken by the 4km-arm detector at Hanford, WA 
(HI) during LIGO’s fourth science run (S4: 24 February - 
24 March 2005), and data taken by the 4km-arm detector at 
Livingston, LA (LI) during one week (28 May - 4 June 2010) 
of LIGO’s sixth science run (S6: 7 July 2009 - 20 October 
2010). Hereafter we refer to these data sets as the S4 and the 
S6 data. 

In the time between the fourth and the sixth science runs, 
the detectors underwent major commissioning and improve- 
ments to their sensitivity. Thus, while the HI and LI detec- 
tors are identical by design, the data taken by HI during S4 
and the data taken by LI during S6 are quite different. These 
data sets represent evolutionary changes in both the detec- 
tor noise power-spectral-density and the non-Gaussian tran- 
sient artifacts. Differences in the detectors’ environments due 
to their distant geographical locations add another degree of 
freedom. Processing data from detectors separated in time 
and location allows us to determine how adaptable and robust 
these analysis algorithms are. This is important when extrap- 
olating their performance to advanced detectors. 

Classification, or the separation of input data into various 
categories, is one of MLAs main uses; thus, they are often 
referred to as classifiers. We have two categories of data: 
glitches (Class 1) and “clean” data (Class 0). If one was to 
perform a search for gravitational-wave transient signals, the 
first category, glitches, would generally be identified as candi- 
date transient events and considered false alarms. The second 
category, “clean” data, contain only Gaussian detector noise. 
A true gravitational-wave signal, when it arrives at the detec- 
tor, is superposed on the Gaussian detector noise. If the sig- 
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rial's amplitude is high enough, it also would be identified by 
the search algorithm as a candidate transient event. Since it 
is a genuine gravitational-wave transient, it would constitute 
an actual detection, as opposed to glitches which act as noise. 
Hereafter we refer to such candidate gravitational-wave tran- 
sient events, either genuine gravitational-wave transients or 
glitches, as transient events or simply as events. 

We characterize a transient event in either class by infor- 
mation from the detector’s auxiliary channels. Importantly, 
we record the same information for both classes of events. 
Each channel records a time-series measuring some non- 
gravitational-wave degree of freedom, either in the detector 
or its environment. We first reduce the time-series to a set of 
non-Gaussian transients using the Kleine Welle analysis algo- 
rithm [17], which finds clusters of excess signal power in the 
dyadic wavelet domain. The detected transients are ranked 
by their statistical significance, defined as the negative loga- 
rithm of the probability that a random cluster of wavelet co- 
efficients subject to Gaussian noise would contain the same 
or greater signal power. The MLA classifiers use the proper- 
ties of auxiliary channel transients coincident in time with the 
gravitational-wave channel event to classify the gravitational- 
wave event. 

Given an event in the gravitational-wave channel at time t, 
we build a feature vector x out of the nearby auxiliary channel 
transients. Each channel contributes five features: 

• p: The significance of the single loudest transient in that 
auxiliary channel within ±100 ms of t. 

• At: The difference between t and the central time cor- 
responding to the auxiliary channel transient. 

• d: The duration of the auxiliary channel transient. 

• /: The central frequency of the auxiliary channel tran- 
sient. 

• n: The number of wavelet coefficients clustered to 
form the auxiliary channel transient (a measure of time- 
frequency volume). 

We require all auxiliary transients to have a significance of at 
least 15, and if no such auxiliary transient is found within 100 
ms of f, the five fields for that channel are set to zero. The 
significance threshold of 15 and 100 ms window are tunable 
parameters. The 100 ms window covers most transient cou- 
pling timescales identified by previous studies [13]. However, 
there is no guarantee that this window is an optimal choice, or 
that it should be the same for all auxiliary channels. In total, 
we analyze 250 (162) auxiliary channels from S6 (S4) data, re- 
sulting in 1250 (810) dimensions for the auxiliary feature vec- 
tor, x. In addition, we record certain bookkeeping information 
about the original gravitational-wave channel event, the state 
of nearby non-Gaussian transients in the gravitational-wave 
channel, and other information about data quality. These val- 
ues are stripped before classifier training and evaluation so 
that we train the classifiers on only information contained in 
the auxiliary features. 

The set of “glitch” (Class 1) samples, {.x} | , is generated 
by running Kleine Welle over the gravitational-wave channel 


from one of the LIGO detectors. This set of non-Gaussian 
transients from the gravitational-wave channel can, in prin- 
ciple, contain true gravitational-waves. However, prior to the 
coincidence requirement, they are overwhelmingly dominated 
by noise artifacts. Even for the most sensitive data set (S6), 
the expected rate of detectable gravitational-wave transients 
from known astrophysical sources is extremely low (~ 10 9 
Hz [5]) with respect to the rate of single-detector noise tran- 
sients (~H). 1 Hz). For the advanced LIGO detectors, it may 
be appropriate to remove coincident gravitational-wave can- 
didates from the glitch training samples to avoid contamina- 
tion from detectable gravitational-wave events. In both classi- 
fiers’ training and performance evaluation, we treat all Kleine 
Welle transients from the gravitational-wave channel as ar- 
tifacts. In total, we identify 2832 (16,204) noise transients 
above a nominal significance threshold of 35 from the Liv- 
ingston LI (Hanford HI) detector during one week of the S6 
(four weeks of the S4) science run. The central time from each 
event is used to trigger feature vector generation, so that {x}i 
is a set of 2832 (16,204) sample vectors, each described by 
1250 (810) features derived from coincident auxiliary chan- 
nel information. The samples are most representative of the 
background in gravitational-wave burst searches which gener- 
ally target short, unmodeled signals. 

“Clean” (Class 0) samples, {x}o, are formed by first gen- 
erating 10 5 randomly distributed times within the data from 
each detector, producing a Poisson distribution mimicking (at 
a greatly exaggerated rate) that which would come from a set 
of true gravitational-wave signals (ignoring effects like detec- 
tor sensitivity). It can also be seen as a sampling of times 
when there was no glitch in the gravitational-wave channel, 
and will sample the typical auxiliary transients that do not 
produce a gravitational-wave glitch. To further aid in distin- 
guishing times when there is no disturbance, we exclude Class 
0 samples which fall within ±100 ms of a Class 1 sample. As 
with Class 1, the full set of Class 0 samples {x}o is built from 
auxiliary channel information nearby each randomly selected 
time. 


III. GENERAL FORMULATION OF THE DETECTION 
PROBLEM 

The data analysis problem which we address here can be 
formulated as the robust identification of transient artifacts 
(glitches) in the gravitational-wave channel based on the in- 
formation contained solely in the safe auxiliary detector chan- 
nels. Clearly, the solution to this problem is directly related 
to the solution to the ultimate problem of robust detection and 
classification of gravitational-wave transients in LIGO data. 
The identification of glitches will reduce the non-Gaussian 
background and improve the sensitivity of gravitational-wave 
searches. We leave the question of how the results of our cur- 
rent analysis of the auxiliary channels can be incorporated into 
the search for transient gravitational waves to future work. 

For a given transient event in the gravitational-wave chan- 
nel, we construct a feature vector of auxiliary information, x, 
following the procedure outlined in Section II. Our detection 
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problem reduces to binary prediction on whether this transient 
is a glitch (Class 1) or a clean sample (Class 0) based on x and 
only x. In feature-space, x £ V ( \ , this binary decision can be 
mapped into identifying domains for Class 1 events, V\, and 
Class 0 events, Vo . We call the surface separating the two 
regions the decision surface. Unless the two classes are per- 
fectly separable, which is typically not the case, there is a non- 
zero probability for an event of one class to occur in a domain 
identified with a different class. In this case, one would like 
to find an optimal decision surface separating two classes in 
such a way that we maximize the probability of finding events 
of Class 1 in Vi at fixed probability of miscatagorizing events 
from Class 0 in V\. This essentially minimizes the probability 
of incorrectly classifying events. Pi represents the probabil- 
ity of glitch detection, which we also call detection efficiency, 
and Pq is called the false alarm probability. This optimiza- 
tion principle is often referred to as the Neyman-Pearson cri- 
teria [18]. 

The probability of detection and the probability of false 
alarm can be expressed in terms of conditional probability 
density functions for the feature vector, x: 


Pi = [ © (/(*) 

Jv d 

— F*)p(x |l)p(l)dx, 

(la) 

Po = [ e (/Or) 

JV A 

— F*)p(x 0)p(0) dx . 

(lb) 


Herep(x 1 1) andp(x | 0) define probability density functions 
for the feature vector in the presence and absence of a glitch 
in the gravitational- wave data, respectively. p( 1) and p(0) are 
prior probabilities for having a glitch or clean data, related to 
one another via p(l) + p(0) = 1.0 (f(x) — F*) defines the 
region V\ which signifies a glitch in the gravitational-wave 
data, and /(x) = F* defines the decision surface. F* is a 
threshold parameter, which corresponds to a specific value of 
the probability of false alarm through (lb). 

The optimal decision surface is found by maximizing the 
functional 


S[f(x)] = Pi[/(x)] - l 0 (Po[/(x)] - P 0 *) , (2) 


where Pq is a tolerable value for the probability of false alarm 
and l 0 is a Lagrange multiplier. Setting the variation of this 
functional with respect to /(x) to zero leads to the condition 
for the points on the decision surface that 


P(x | l)p(l) 
p(x | 0)p(0) 


= CONSTANT . 


(3) 


The ratio of conditional probability density functions. 


A(x) = 


p{x 1 1 ) 

p(x | 0) ’ 


(4) 


is called the likelihood ratio (sometimes also referred to as 
the Bayes factor). The CONSTANT in the optimality condition 
(3) does not carry any special meaning, and the condition can 
be satisfied if the decision surface is defined as the surface of 
constant likelihood ratio [19], 


with F* set by the probability of false alarm, Pq , through 
(lb). Actually, the decision surface can be defined by any 
monotonic function of the likelihood ratio with trivial redefi- 
nition of F*. There is a unique decision surface for each value 
of Pq £ [0, 1], and we can label decision surfaces by their cor- 
responding values of Pq . 

The optimization of (2) maximizes the detection probabil- 
ity, Pi — > P° PT , for every value of the probability of false 
alarm, Po = Pq. The curve P 1 OPT (Po) is called the Re- 
ceiver Operating Characteristic (ROC) curve. It is a standard 
measure of any detection algorithm’s performance. We can 
think of optimizing (2) as maximizing the area under the ROC 
curve. For further details on use of the likelihood ratio in the 
gravitational-wave searches, see [19, 20]. 

Finding the optimal decision surfaces by direct estimation 
of the conditional probability density functions, p(x 1 1) and 
p(x | 0), is an extremely difficult task if the feature vector con- 
tains more than a few dimensions. For high-dimensional prob- 
lems, when no parametric model for these probability distri- 
butions is known and with a limited number of experimental 
samples that can be used to estimate these probability density 
functions, one has to resort to some other method. MLAs are 
well-suited for these detection problems. 

In this paper, we consider three popular MLAs: ANN, 
SVM and RF. They differ significantly in their underlying 
algorithms and their approaches to classification. This allows 
us to investigate the applicability of different types of MLAs 
to glitch identification in the gravitational-wave data. How- 
ever, all MLAs require training samples of events from both 
Class 1 and Class 0. The MLA classifiers use the training sets 
to find an optimal classification scheme or decision surface. In 
the limit of infinitely many samples and unlimited computa- 
tional resources, different classifiers should recover the same 
theoretical result, the decision surface defined by the constant 
likelihood ratio (5). To this end, it is critical that classifiers 
are trained and optimized using criteria consistent with this 
result. In Appendix A, we explore several standard optimiza- 
tion criteria and derive the decision surfaces they generate in 
this theoretical limit. We find that all of these criteria lead to a 
decision surface with constant likelihood ratio. In particular, 
this is true for the fraction of correctly classified events and 
the Gini index criteria that are used by ANN /SVM and RF, 
respectively. 

While all classifiers we investigate here should find the 
same optimal solution with sufficient data, in practice, the al- 
gorithms are limited by the finite number of samples in the 
training sets and by computational cost. The classifiers have 
to handle a large number of dimensions efficiently, many of 
which might be redundant or irrelevant. By no means is it 
clear that the MLA classifiers will perform well under such 
conditions. It is our goal to demonstrate that they do. 

We evaluate their performance by computing ROC curves. 

1 These curves, which map the classifiers’ overall efficiencies, 
are objective and can be directly compared. In addition to 


1 More traditional veto approaches to data quality in gravitational-wave 
searches use another measure of veto quality. For a given veto configu- 
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comparing the MLA classifiers to each other, we benchmark 
them using ROC curves from the OVL algorithm [21]. This 
method constructs segments of data to be vetoed using a hard 
time window and a threshold on the significance of transients 
in the auxiliary channels. The veto segments are constructed 
separately for different auxiliary channels and are applied 
in the order of decreasing correlation with the gravitational- 
wave events. By construction, only pairwise correlations be- 
tween a single auxiliary channel and the gravitational-wave 
channel are considered by the OVL algorithm. These results 
have a straightforward interpretation and provide a good san- 
ity check. 

In order to make the classifier comparison as fair as possi- 
ble, we train and evaluate their performances using exactly the 
same data. Furthermore, we use a round-robin procedure for 
the training-evaluation cycle, which allows us to use all avail- 
able glitch and clean samples. Samples are randomized and 
separated into ten equal subsets. To classify events in the k th 
subset, we use classifiers trained on all but the k th subset. In 
this way, we ensure that training and evaluation are done with 
disjoint sets so that any over-training that might occur does 
not bias our results. 

An MLA classifier’s output is called a rank, tmla £ [0, 1], 
and a separate rank is assigned to each glitch and clean sam- 
ple. Higher ranks generally denote a higher confidence that 
the event is a glitch. A threshold on this rank maps to the 
probability of false alarm, P 0 , by computing the fraction of 
clean samples with greater or equal rank. Similarly, the prob- 
ability of detection or efficiency, Pi, is estimated by comput- 
ing fraction of glitches with greater or equal rank. Essentially, 
we parametrically define the ROC curve, P° PT (Po), with a 
threshold on the classifier’s rank. Synchronous training and 
evaluation of the classifiers allow us to perform a fair compar- 
ison and to investigate various ways of combining the outputs 
of different classifiers. We discuss our findings in detail in 
Sections VI and VII. 


IV. OVERVIEW OF THE MACHINE LEARNING 
ALGORITHMS 

In this section, we give a short overview of the basic prop- 
erties of the classifiers and the tuning procedures used to de- 
termine the best performing configurations for each classifier. 
Throughout this section, we will use the notation Xi where 
* = 1,2, ...N to denote the set of N sample feature vectors. 
Similarly, y, will denote the actual class associated with the 


ration consisting of a list of disjoint segments of data, the fractional “dead- 
time” is computed from the sum of the durations of all data segments to be 
vetoed. While not precisely the same, this quantity is related to the proba- 
bility of false alarm, Pq , which accounts only for the fraction of clean data 
removed from the search. For a typical rate of glitches of ~0.1 Hz, the two 
measures are almost identical in the most relevant region of Po < 0.01. 
Thus, in that interval the ROC curves of this paper can be directly com- 
pared to the often used figure of merit, efficiency vs. fractional dead-time. 
See for example [13]. 


the i th sample feature vector, either Class 0 or Class 1 . Predic- 
tions about a feature vector’s class will be denoted by y(xi). 


A. Artificial Neural Network 

ANNs employ a machine learning technique based on sim- 
ulating the data processing in human brains and mimicking 
biological neural networks [22, 23]. As is well-known, the 
human brain is composed of a tremendous number of inter- 
connected neurons, with each cell performing a simple task 
(responding to an input stimulus). However, when a large 
number of neurons form a complicated network structure, they 
can perform complex tasks such as speech recognition and 
decision-making. 

A single neuron is composed of dendrites, a cell body, and 
an axon. When dendrites receive an external stimulus from 
other neurons, the cell body computes the signal. When the to- 
tal strength of the stimulus is greater than the synapse thresh- 
old, the neuron is fired and sends an electrochemical signal to 
other neurons through the axon. This process can be imple- 
mented with a simple mathematical model including nodes, 
a network topology and learning rules adopted to a specific 
data processing task. Nodes are characterized by their num- 
ber of inputs and outputs (essentially how many other nodes 
they talk to), and by the connecting weights associated with 
each input and output. The network topology is defined by 
the connections between the neurons (nodes). The learning 
rules prescribe how the connecting weights are initialized and 
evolve. 

There are a large number of ANN models with different 
topologies. For our purpose, we choose to implement the 
multi-layered perceptron (MLP) model, which is one of the 
most widely used models. The MLP model has input and out- 
put layers as well as a few hidden layers in between. The in- 
put vector for the input layer is the auxiliary feature vector, x, 
while the input for hidden layers and the output layer is a com- 
bination of the output from nodes in the previous layer. We 
will call these intermediate vectors z to distinguish them from 
the full feature vector. Each layer has several neurons which 
are connected to the neurons in the adjacent layers with indi- 
vidual connecting weights. The initial structure - the number 
of layers, neurons, and the initial value of connecting weights 
- is chosen by hand and/or through an optimization scheme 
such as a genetic algorithm (GA). 

When a neuron’s input channels receive an external signal 
exceeding the threshold value set by an activation function, the 
neuron is fired. This process can be expressed mathematically 
as: 


y(z) = f(w z + b), (6) 

where y(z) is the output, z is an input vector, w are connect- 
ing weights, / is an activation function, and b is a bias. One 
may choose the activation function to be either the identity 
function, the ramp function, the step function, or a sigmoid 
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function. We use the sigmoid function defined by 

f{wz + b) = (l+e~ 2s{w - z+b) y 1 . (7) 

We set the activation steepness s = 0.5 in hidden layers 
and s = 0.9 at the output layer. There is a single neuron at the 
output layer, and the value of that neuron’s activation function 
is used as ANN’s rank, tann- 

The topological parameters determine the number of con- 
nection weights, which must be sufficiently large that ANN 
has enough degrees of freedom to classify a given datum. The 
network’s flexibility depends on the number of connection 
weights and should be matched with the size of the training 
sets and the input data’s dimensionality. In our work, the num- 
bers of layers and neurons are chosen so that the total number 
of connection weights is on the order of 10 4 when using the 
entire data set. In order to avoid overtraining, we decrease the 
number of layers and neurons in the runs in which either the 
dimensionality or the number of training samples are reduced. 

The learning scheme finds the optimal connection weights, 
w, in each layer. In this paper we use the improved version 
of the Resilient back PROPagation (iRPROP) algorithm [24], 
which minimizes the error between the output value, y{xi), 
and the known value, yi. In this algorithm, the increase (de- 
crease) factor and the minimum (maximum) step-size deter- 
mine the change in the connection weights. Aw, at each iter- 
ation during the training. In our work, the default values in 
the FANN! (FANN!) library [25] are used for all parameters 
except the increase factor, which is set to y + = 1.001. The 
same learning rules were used in all runs. We should note that 
ANNs can be optimized in an alternative way, via a GA or 
other similar methods. When using a GA, a combined opti- 
mization algorithm for topology, feature and weight selection 
can be applied to improve the performance of ANN [26-30]. 
We explore these options in a separate publication [31]. 

In addition to the choosing the ANN configuration param- 
eters, we found that ANN requires data pre-processing. The 
input variables with high absolute values have a greater ef- 
fect on the output values, thus we normalize all components 
of the feature vector, x, to the range [0, 1]. To better resolve 
small At values, At is transformed via a logarithmic function 
before normalization. 


At' = — sign(Af) log | At | . (8) 

This transformation improves ANN’s ability to identify 
glitches, which tend to have smaller values of At. You can 
find a more detailed description of the procedure for tuning 
the ANN configuration parameters in [31]. 


B. Support Vector Machines 

The SVM is an MLA for binary classification on a vector 
space [32, 33]. It finds the optimal hyperplane that separates 
the two classes of training samples. This hyperplane is then 


used as the decision surface in feature space, and classifies 
events depending on which side of the hyperplane they fall. 

As before, let {( Xi , yi) | i = 1, 2, ...TV} be the training data 
set, where Xi is the feature vector of auxiliary transient infor- 
mation near time ti, and y, € {1, —1} is a label that marks the 
sample as either Class 1 or Class 0, respectively. Then assume 
that the training set is separable by a hyperplane w-x — b = 0 , 
where w is the normal vector to the hyperplane and b is the 
bias. Then the training samples with yi = 1 satisfy the condi- 
tion w-Xi — b > 1, and the training samples with yi = — 1 sat- 
isfy the condition w ■ Xi — b < —1. SVM uses a quadratic pro- 
gramming method to find the w and b that maximize the mar- 
gin between the hyperplanes w-x — b= 1 and w ■ x — b = — 1. 

If the training samples are not separable in the original fea- 
ture space, Vd, SVM uses a mapping, (p(x), into a higher di- 
mensional vector space, V^, in which two classes of events 
can be separated. The decision hyperplane in V 0 corresponds 
to a non-linear surface in the original space, Vd- Thus, map- 
ping the problem into a higher dimensional space allows SVM 
to consider non-linear decision surfaces. The dimensionality 
of Vfj, grows exponentially with degree of the non-linearity of 
the decision surfaces in Vd. As a result, SVM can not con- 
sider arbitrary decision surfaces and usually has to deal with 
non-separable populations. If the training samples are not sep- 
arable after mapping, a penalty parameter, C, is introduced to 
weight the training error. Finding the optimal hyperplane is 
reduced to the following quadratic programming problem: 


min ( ■ w + C V & ) , (9a) 

wM V tW 

subject to yi ■ ( w ■ cp(xi) + b) > 1 — , (9b) 

& > 0 , i = 1, 2, ..., TV . (9c) 

When the solution is found, SVM classifies a sample x by the 
decision function: 

y(xi) = sign (w ■ cj>(xi) + b) . (10) 

In solving the quadratic programming problem, the func- 
tion tp is not explicitly needed. It is sufficient to specify 
4>{xi)-(j){xj). The function K{xi, Xj) = <f>(xi)-(f>(xj) is called 
the kernel function. The form of the kernel function implicitly 
defines the family of surfaces in Vd over which SVM is opti- 
mizing. In this study we use the Radial Basis function (RBF) 
as the kernel function. It is defined as 

K(xi,Xj) = exp {- 7 ||xi - Xj\\ 2 } , ( 11 ) 

where 7 is a tunable parameter. 

The SVM algorithm was implemented by using the open- 
source package libsvm [34], As part of this package, the 
kernel function parameter, 7, and the penalty parameter, C, 
are tuned in order to achieve the best performance for a spe- 
cific application. The best parameters ( C and 7) are selected 
through an exhaustive search. For each pair of parameters 
(logC, log 7) on a grid, we calculate a figure of merit. The 
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parameters with the best figure of merit are then used for clas- 
sifying events. The default figure of merit in libsvm is the 
accuracy (fraction of correctly classified events). However, 
we replace it with a figure of merit better adapted to glitch de- 
tection. Instead of using the accuracy, our code calculates the 
area under the estimated ROC curve (Pf ST (Po)) in the interval 
of the probability of false alarm P 0 £ [0.001, 0.05] on a log- 
linear scale ([0.001, 0.05] is a range of acceptable probability 
of false alarm for practical glitch detection). 

Po— 0.05 

figure of merit — J d (lnP 0 ) -Pi ST (-Fo) ( 12 ) 
p 0 =o.ooi 

Performing an exhaustive search for the best SVM parame- 
ters is computationally expensive. We can speed up this tuning 
process by exploiting the fact that the tuning time grows non- 
linearly with the training sample size. By using smaller train- 
ing sets, we can reduce the total time spent determining the 
optimal parameters. We randomly selected p subsets of vec- 
tors from the training set, with each subset 10 times smaller 
than the full training set. The best pair of the SVM parame- 
ters for each of the p subsets was then calculated, with each 
subset optimization running on a single CPU core. This gives 
p sets of best parameters, calculated in parallel. The parame- 
ters C and 7 that are selected the most often were then cho- 
sen as the final best SVM parameters. This modified parame- 
ter optimization algorithm was applied to various training sets 
(described in Section V). We found that the optimal SVM pa- 
rameters do not depend on the training set. We, therefore, use 
the same parameters for all calculations reported in this paper 
(C = 8 and 7 = 0.0078125). 

In its standard configuration, SVM classifies samples by a 
discrete label, y £ {1,-1}. However, the libsvm package can 
provide a probability based version of ( 10 ) that yields con- 
tinuous values, tsvm £ [0, 1] [35], We use these continuous 
values as the output of the SVM classifier. 

C. Random Forest Technology 

RF technology [36, 37] improves upon the classical deci- 
sion tree [38, 39] approach to classification. The classifying 
decision tree performs a series of binary splits on any / all of 
the dimensions of the feature vector, x, that describes an event. 
The goal is to distribute events into groups consisting of only 
a single class. In a machine-learning context, the decision tree 
is formed by training it on a set of events of known class. Dur- 
ing the training, a series of splits are made, where each split 
chooses the dimension and its threshold that optimizes a cer- 
tain criterion, such as the fraction of correctly classified train- 
ing events or the Gini index. Splitting stops once no split can 
further improve the optimization criterion or once the limit on 
the minimum number of events allowed on a branch (the fur- 
thest reaches of a decision tree) is reached; at this point the 
branch becomes a leaf. Once a tree is formed, an event of 
unknown class is fed into the tree, and depending on its fea- 
ture vector, x, it will be labeled as either Class 0 or Class 1. 


However, a single decision tree can be a victim to both false 
minima and over-training. To guard against this, we create a 
forest of decision trees and average over their answers; this 
results in a continuous ranking, trf £ [ 0 , 1 ], rather than a bi- 
nary classification, as events can be placed on a continuum 
between Class 0 and Class 1. 

Each decision tree in the forest is trained on a bootstrap 
replica of the original training set. If the original training set 
has N events, each bootstrap replica will also have N events, 
which are chosen randomly with replacement, meaning any 
given event may be picked more than once. Therefore, each 
tree gets a different set of training events. To further avoid 
false minima, random forest technology chooses a different 
random subset of the features to be available for splitting at 
each node. This ensures that a peculiarity in a particular di- 
mension does not dominate the decision making process. 

We use the StatPatternRecognition software package’s [40] 
implementation of RF. The key input parameters are the num- 
ber of trees in the forest, the number of features randomly 
selected for splitting at each tree node (branching point), 
the minimum number of samples on the terminal tree nodes 
(leaves) and the optimization criterion. To determine the best 
set of the RF parameters, we perform a search over a coarse 
grid in the parameter space, maximizing efficiency or the 
detection probability. Pi, at the probability of false alarm, 
Po = 0.01. We find that beyond a certain point, the RF ef- 
ficiency grows very slowly with the number of trees and the 
number of features selected for splitting at the cost of a sig- 
nificant increase in running time during the training process. 
Taking this into account, we arrive at the following configu- 
ration, which we use in all runs: 100 trees in the forest, 64 
different randomly chosen features at the branching points, a 
minimum of 8 samples on a leaf, and the Gini index as the 
optimization criterion. 


D. Ordered Veto List Algorithm 

The OVL algorithm operates by looking for coincidences 
between the transients in gravitational-wave and auxiliary 
channels. Specifically, the transients identified in the auxiliary 
channel are used to construct a list of time segments. All tran- 
sients in the gravitational-wave channel occurring within these 
segments are removed from the list of transient gravitational- 
wave candidates. In effect, the data in these time segments are 
vetoed prior to any search for gravitational- waves. 

The algorithm assumes transients in certain auxiliary chan- 
nels are more correlated with the glitches in gravitational- 
wave channel and looks for a hierarchy correlations between 
auxiliary and gravitational-wave glitches. Specifically, a se- 
ries of veto configurations is created, corresponding to dif- 
ferent auxiliary channels, the time windows around transients 
and the threshold on their significance. The ordered list cor- 
responds to a list of these configurations, and veto configura- 
tions are applied to the data in order of decreasing correlation. 
For this study, the maximum time window is set to ± 100 ms 
to match the one we use to create auxiliary feature vectors for 
the MLA classifiers (Section II). Similarly, the lowest thresh- 
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old on significance, p, is set to the auxiliary channel nominal 
threshold of 15. For each channel, the number of possible veto 
configurations is equal to the number of unique combinations 
that can be constructed from the list of the time windows, [± 
25 ms, ± 50 ms, ± 100 ms], and the significance thresholds, 
[15, 25, 30, 50, 100, 200, 400, 800, 1600], 

Importantly, a segment removed by a veto configuration is 
not seen by later configurations. This prohibits duplicate ve- 
toes and results in a measurement of the additional informa- 
tion contained in subsequent veto configurations. The per- 
formance of each configuration is evaluated and they are re- 
ranked accordingly. The OVL algorithm defines the veto- 
configuration rank, rovL, as the ratio of the fraction of 
gravitational-wave glitches removed to the fraction of analy- 
sis time removed. Repeated application of the algorithm pro- 
duces an ordered list with the better performing configurations 
appearing higher on the list. 

Only some of the veto configurations make it to the final 
list. Those which perform poorly (rovL < 3) are discarded. 
This is done in order to get rid of irrelevant or redundant chan- 
nels and to speed up the algorithm’s convergence. Typically, 
the OVL algorithm converges within less than 10 iterations to 
a final ordered list. We find that only 47 out of 162 auxiliary 
channels in S4 data and 35 out of 250 auxiliary channels in 
S6 data appear on the final list. Below, we refer to this sub- 
set of channels as the “OVL auxiliary channels.” For a more 
detailed description of the OVL algorithm, see [21]. 

The procedure for optimizing the ordered list of veto con- 
figurations can be considered a training phase. An ordered list 
of veto configurations optimized for a given segment of data 
can be applied to another segment of data. Veto segments 
are generated based on the transients in the auxiliary channels 
and the list of configurations. Performance of the algorithm is 
evaluated by counting fractions of removed glitches and clean 
samples, and computing the ROC curve. As with MLA clas- 
sifiers, we use the round-robin procedure for OVL’s training- 
evaluation cycle. 


V. TESTING THE ALGORITHMS’ ROBUSTNESS 

One of the main goals of this study is to establish if MLA 
methods can successfully identify transient instrumental and 
environmental artifacts in LIGO gravitational-wave data. The 
potential difficulty arises from high dimensionality and the 
fact that information from a large number of dimensions might 
be either redundant or irrelevant. Furthermore, the origin of 
a large fraction of glitches is unknown in the sense that their 
cause has not been pinpointed to a single instrumental or envi- 
ronmental source. In the absence of such deterministic knowl- 
edge, one has to monitor a large number of auxiliary chan- 
nels and look for statistically significant correlations between 
transients in these channels and transients in the gravitational- 
wave channel. These correlations, in principle, may involve 
more than one auxiliary channel and may depend on the tran- 
sients’ parameters in an extremely complicated way. Addi- 
tionally, new kinds of artifacts may arise if one of the detec- 
tor subsystems begins to malfunction. Likewise, some aux- 


iliary channels’ coupling strengths to the gravitational-wave 
channel may be functions of the detector’s state (e.g. opti- 
cal cavity configuration and mirror alignment). Depending 
on the detector’s state, the same disturbance witnessed by 
an auxiliary channel may or may not cause a glitch in the 
gravitational-wave channel. This information can not be cap- 
tured by the Kleine Welle-derived parameters of the transients 
in the auxiliary channels alone and requires extending the cur- 
rent method. We leave this problem to future work. 

Because of the uncertainty in the types and locations of 
correlations, we include as many auxiliary channels and their 
transients’ parameters as possible. However, this forces us 
to handle a large number of features, many of which might 
be either redundant or irrelevant. The MLA classifiers may 
be confused by the presence of these superfluous features and 
their performance may suffer. One can improve performance 
by reducing the number of features and keeping only those 
that are statistically significant. However, this requires pre- 
processing the input data and tuning, which may be extremely 
labor intensive. On the other hand, if the MLA classifier can 
ignore irrelevant dimensions automatically without a signifi- 
cant decrease in performance, it can be used as a robust anal- 
ysis tool for real-time glitch identification and detector char- 
acterization. By efficiently processing information from all 
auxiliary channels, a classifier will be able to identify new ar- 
tifacts and help to diagnose problems with the detector. 

In order to determine our classifiers’ robustness, we per- 
form a series of runs in which we vary the dimensionality of 
the input data and evaluate the classifiers’ performance. First, 
we investigate how their efficiency depends on which tran- 
sient parameters are used. We expect that not all of the five 
parameters ( p , At, /, d, n) are equally informative. Naively, 
p and Af, reflecting the disturbance’s amplitude in the auxil- 
iary channel and its degree of coincidence with the transient in 
gravitational-wave channel, respectively, should be the most 
informative. Potentially, the frequency, /, duration, d, and 
the number of wavelet coefficients, n, may carry useful in- 
formation if only certain auxiliary transients produce glitches. 
However, it is possible that these parameters are only corre- 
lated with the corresponding parameters of gravitational-wave 
transient, which we do not incorporate in this analysis. Such 
correlations, even if not broadened by frequency dependent 
transfer functions, would require analysis specialized to spe- 
cific gravitational-wave signals and goes beyond the scope of 
this paper. We perform a generic analysis, not relying on the 
specific characteristics of the gravitational-wave transients. 

Anticipating that some of the parameters could be irrele- 
vant, we prepare several data sets by removing features from 
the list: (p, At, f, d, n). We prepare these data sets for 
both S4 and S6 data and run each of the classifiers through 
the training-evaluation round-robin cycles described in Sec- 
tion III. We evaluate their performance by computing the 
ROC curves, shown in Figure 1 . 

We note the following relative trends in the ROC curves 
for all classifiers. The omission of the transient’s duration, 
d, and the number of wavelets, n, has virtually no effect 
on efficiency. The ROC curves are the same to within our 
error, which is less than ± 1 % for our efficiency mea- 
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FIG. 1. Varying sample features. We expect some of the five features recorded for each auxiliary channel to be more useful than others. To 
quantitatively demonstrate this, we train and evaluate our classifiers using subsets of our sample data, with each subset restricting the number 
of auxiliary features. We observe the general trend that the significance, p, and time difference. At, are the two most important features. 
Between those two, p appears to be marginally more important than At. On the other hand, the central frequency, /, the duration, d, and the 
number of wavelet coefficients in an event, n, all appear to have very little affect on the classifiers’ performance. Importantly, our classifiers 
are not impaired by the presence of these superfluous features and appear to robustly reject irrelevant data without significant efficiency loss. 
The black dashed line represents a classifier based on random choice. 


surement, based on the total number of glitch samples and 
the normal approximation for binomial confidence interval, 
\J I\ ( 1 - 1\ )/N . Omission of the frequency, /, slightly re- 
duces the efficiency of SVM (Figure lb and Figure le), but 
has no effect on either ANN or RF. A comparison between 
the ROC curves for ( p , At), (p) and (At) data sets shows 
that while a transient’s significance is the most informative pa- 
rameter, including the time difference generally results in bet- 
ter overall performance. Of the three MLA classifiers, SVM 
seems to be the most sensitive to whether the time difference 
is used in addition to significance. RF, as it appears, relies pri- 
marily on significance, which is reflected in poor performance 
of the (Af)-only ROC curves in Figure lc and Figure If. The 
trend for ANN is not as clear. In S4 data, including timing 
does not change the ROC curve (Figure la) while in S6 data 
it improves it (Figure Id). Overall, we conclude that based 
on these tests, most if not ah the information about detected 
glitches is contained in the ( p , At) pair. At the same time, 
keeping irrelevant features does not seem to have a negative 
effect on our classifiers’ performance. 

The OVL algorithm, which we use as a benchmark, ranks 
and orders the auxiliary channels based on the strength of cor- 


relations between transient disturbances in the auxiliary chan- 
nels and glitches in gravitational-wave channel. The final list 
of OVL channels includes only a small subset of the available 
auxiliary channels, 47 (of 162) in S4 data and 35 (of 250) in 
S6 data. The rest of the channels do not show statistically 
significant correlations. It is possible that these channels con- 
tain no useful information for glitch identification, or that one 
has to include correlations involving multiple channels and/or 
other features to exract the useful information. In the for- 
mer case, throwing out irrelevant channels will significantly 
decrease our problem’s dimensionality and may improve the 
classifiers’ efficiency. In the latter case, classifiers might be 
capable of using higher order correlations to identify classes 
of glitches missed by OVL. 

We prepare two sets of data to investigate these possibili- 
ties. In the first data set, we use only the OVL auxiliary chan- 
nels and exclude information from all other channels. In the 
second data set, we further reduce the number of dimensions 
by using only p and At. We apply classifiers to both data sets, 
evaluate their performance and compare it to the run over the 
full data set (all channels and all features). Figure 2 shows the 
ROC curves computed for these test runs. 
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FIG. 2. Reducing the number of channels. One way to reduce the dimensionality of our feature space is to reduce the number of auxiliary 
channels used to create the feature vector. We use a subset of auxiliary channels identified by OVL as strongly correlated with glitches in the 
gravitational-wave channel (light blue). We notice that for the most part, there is not much efficiency loss when restricting the feature space 
in this way. This also means that very little information is extracted from the other auxiliary channels. The classifiers can reject extraneous 
channels and features without significant loss or gain of efficiency. We also restrict the feature vector to only include the significance, p, and the 
time difference, At, for the OVL auxiliary channels (green). Again, there is not much efficiency loss, suggesting that these are the important 
features and that the classifiers can robustly reject unimportant features automatically. The black dashed line represents a classifier based on 
random choice. 


In both S4 and S6 data, the three curves for RF (Figure 2c 
and Figure 2f) lay on the top of each other, demonstrating that 
this classifier’s performance is not affected by the data reduc- 
tion. ANN shows slight improvement in its performance for 
the maximally reduced data set in the S6 data (Figure 2d), and 
no discernible change in the S4 data (Figure 2a). SVM ex- 
hibits the most variation of the three classifiers. While drop- 
ping the auxiliary channels not included in the OVL list has a 
very small effect on SVM’s ROC curve, further data reduction 
leads to an efficiency loss (Figure 2b and Figure 2e). Viewed 
together, the plots in Figure 2 imply that, on one hand, non- 
OVL channels can be safely dropped from the analysis, but 
on the other hand, the presence of these uninformative chan- 
nels does not reduce our classifiers’ efficiency. This is reas- 
suring. As previously mentioned, one would like to use these 
methods for real-time classification and detector diagnosis, in 
which case monitoring as many channels as possible allows us 
to identify new kinds of glitches and potential detector mal- 
functions. For example, an auxiliary channel that previously 
showed no sign of a problem may begin to witness glitches. If 
excluded from the analysis based on its previous irrelevance. 


the classifiers would not be able to identify glitches witnessed 
by this channel or warn of a problem. 

Another way in which input data may influence a classi- 
fier’s performance is by limiting the number of samples in 
the training set. Theoretically, the larger the training sets, 
the more accurate a classifier’s prediction. However, larger 
training sets come with a much higher computational cost and 
longer training times. In our case, the size of the glitch train- 
ing set is limited by the glitch rate in the gravitational-wave 
channel and the duration of the detector’s run. We remind the 
reader that we use four weeks from the S4 am from the HI 
detector and one week from the S6 run from the LI detec- 
tor to collect glitch samples. One would like to use shorter 
segments to better capture non-stationarity of the detector’s 
behavior. However, having too few glitch samples would not 
provide a classifier with enough information. Ultimately, the 
size of the glitch training set will have to be tuned based on 
the detector’s behavior. We have much more control over the 
size of the clean training set, which is based on completely 
random times when the detector was operating in the science 
mode. In our simulations, we start with 10 5 clean samples. 
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but it might be possible to reduce this number without loss of 
efficiency, thereby speeding up classifier training. 

We test how the classifiers’ performance is affected by the 
size of the clean training set in a series of runs in which we 
gradually reduce the number of clean samples available. Runs 
with 100%, 75%, 50%, and 25% of the total number of clean 
samples available for training are supplemented by a run in 
which the number of clean training samples is equal to the 
number of glitch training samples (16% in S4 data and 2.5% 
in S6 data). In addition, we performed one run in which we re- 
duced the number of glitch training samples by half, but kept 
100% of the clean training samples. While not completely 
exhaustive, we believe these runs provide us with enough in- 
formation to describe the classifiers’ behavior. In all of these 
runs, we use all available samples for evaluation, employing 
the round-robin procedure. Figure 3 demonstrates changes in 
the ROC curves due to the variation of training sets. 

RF performance (Figure 3c and Figure 3f) is not affected by 
reduction of the clean training set in the explored range, with 
the only exception of the run over S6 data where size of the 
clean training set is to 2.5% of the original. In this case, the 
ROC curve shows an efficiency loss on the order of 5% at a 
false alarm probability of Po = 10 -3 . Also, cutting the glitch 
training set by half does not affect RF efficiency in either S4 
or S6 data. 

SVM’s performance follows very similar trends, shown in 
Figure 3b and Figure 3e, demonstrating robust performance 
against the reduction of the clean training set and suffering 
appreciable loss of efficiency only in the case of the smallest 
set of clean training samples. Unlike RF, SVM seems to be 
more sensitive to variations in size of glitch training set. The 
ROC curve for the 50% glitch set in S6 data drops 5%-10% in 
the false alarm probability region of / q = 10 -3 (Figure 3e). 
However, this does not happen in the S4 run (Figure 3e). This 
can be explained by the fact that S4 glitch data set has five 
times more samples than S6 set. Even after cutting it in half, 
the S4 set provides better sampling than the full S6 set. 

ANN is affected most severely by training set reduction 
(Figure 3a and Figure 3d). First, its overall performance vis- 
ibly degrades with the size of the clean training set, espe- 
cially in the S6 runs (Figure 3d). Howeer, we note that the 
ROC curve primarily drops near a false alarm probability of 
Po = 10 -3 , it remains the same near Pq = 10 -2 (for all 
but the 2.5% set). The higher Pq value is more important in 
practice because the probability of false alarm of 10 -2 is still 
tolerable and, at the same time, the efficiency is significantly 
higher than at Po = ICC 3 . This means that we are likely to 
operate a real-time monitor near Po = 10 -2 rather than near 
10~ 3 . Reducing the training sample introduces an artifact on 
ANN’s ROC curves, not seen on either RF or SVM. Here, 
the false alarm probability’s range decreases with the size of 
the clean training set. This is due to the fact that with the 
ANN configuration parameters used in this analysis, ANN’s 
rank becomes more degenerate when less clean samples are 
available for training, meaning that multiple clean samples in 
the evaluation set are assigned exactly the same rank. This is 
in general undesirable, because a continuous, non-degenerate 
rank carries more information and can be more efficiently in- 


corporated into gravitational-wave searches. The degeneracy 
issue of ANN and its possible solutions are treated in detail in 

[31]. 

We would like to highlight the fact that in our test runs, we 
use data from two different detectors and during different sci- 
ence runs, and that we test three very different classifiers. The 
common trends we observe are not the result of peculiarities 
in a specific data set or an algorithm. It is reasonable to expect 
that they reflect generic properties of the detectors’ auxiliary 
data as well as the MLA classifiers. Extrapolating this to fu- 
ture applications in advanced detectors, we find it reassuring 
that the classifiers, when suitably configured, are able to mon- 
itor large numbers of auxiliary channels while ignoring irrel- 
evant channels and features. Furthermore, their performance 
is robust against variations in the training set size. In the next 
sections we compare different classifiers in their bulk perfor- 
mance as well as in sample-by-sample predictions using the 
full data sets. 


VI. EVALUATING AND COMPARING CLASSIFIERS’ 
PERFORMANCE 

The most relevant measure of any glitch detection algo- 
rithm’s performance is its detection efficiency, the fraction of 
identified glitches, Pi, at some probability of false alarm, Po. 
The ROC curve is the key figure of merit and can be used 
to assess the algorithm’s efficiency, and objectively compare 
it to other methods throughout the entire range of the prob- 
ability of false alarm. This is useful because the upper limit 
for acceptable values of probability of false alarm depends on 
the specific application. In the problem of glitch detection in 
gravitational-wave data, we set this value to be P 0 = 10 -2 , 
which corresponds to 1% of true gravitational -wave transients 
falsely labeled as glitches. Another way to interpret this is 
that 1% of the clean science data are removed from searches 
for gravitational waves. 

Our test runs, described in the previous section, demon- 
strate the robustness of the MLA classifiers against the pres- 
ence of irrelevant features in the input data. We are interested 
in measuring the classifiers’ efficiency in the common case 
where no prior information about relevance of the auxiliary 
channels is assumed. For this purpose, we use the full S4 and 
S6 data sets, including all channels with a wide selection of 
parameters. Using exactly the same training/evaluation sets 
for all our classifiers allows us to assign four ranks, (tann, 
rsvM, 7 "rp, r ( ;)v L ), to every sample and compute the probabil- 
ity of false alarm, Po(rj) and efficiency, Pi(ry) for each of 
the classifiers. While the ranks can not be compared directly, 
these probabilities can. Any differences in classifiers’ predic- 
tions, in this case, are from the details and limitations of the 
methods themselves, and are not from the training data. 

Glitch samples that are separated in time by less than a sec- 
ond are likely to be caused by the same auxiliary disturbance. 
Even if they are not, gravitational-wave transient candidates 
detected in a search are typically “clustered” with a time win- 
dow ranging from a few hundred milliseconds to a few sec- 
onds, depending on the length of the targeted gravitational- 
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(c) S4 RF 
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(d) S6 ANN 


(e) S6 SVM 


(f) S6RF 


FIG. 3. Varying the size of training data sets. In our sample data, the number of glitches is limited by the actual glitch rate in the LIGO detectors 
and the length of the analysis time we use. However, we can construct as many clean samples as necessary because we sample the auxiliary 
channels at random times. In general, classifiers’ performance will increase with larger training data sets, but at additional computational cost. 
We investigate the effect of varying the size of training sets on the classifiers’ performance, and observe only small changes even when we 
significantly reduce the number of clean samples. We also reduce the number of glitch samples, observing that the classifiers are more sensitive 
to the number of glitches provided for training. This is likely due to the smaller number of total glitch samples, and reducing the number of 
glitches may induce a severe undersampling of feature space. The black dashed line represents a classifier based on random choice. 


wave signal. Clustering implies that among all candidates 
within the time window, only the one with highest statistical 
significance will be retained. In order to avoid double count- 
ing of possibly correlated glitches and to replicate conditions 
similar to a real-life gravitational-wave search, we apply a 
clustering procedure to the glitch samples, using a one second 
time window. In this time window, we keep the sample with 
the highest significance, p, of the transient in gravitational- 
wave channel. 

The ROC curves are computed after clustering. Figure 4 
shows the ROC curves for ANN, SVM, RF and OVL for both 
S4 and S6 data. 

All our classifiers show comparable efficiencies in the most 
relevant range of the probability of false alarm for practical 
applications (10 -3 - 10~ 2 ). Of the three MLA classifiers, RF 
achieves the best efficiency in this range, with ANN and SVM 
getting very close near P 0 = 10” 2 . Relative to other classi- 
fiers, SVM performs worse in the case of S4 data, and ANN’s 
efficiency drops fast at P < HP 3 . The most striking fea- 
ture on these plots is how closely the RF and the OVL curves 
follow each other in both S4 and S6 data (Figure 4a and Fig- 
ure 4b respectively). In absolute terms, the classifiers achieve 


significantly higher efficiency for S6 than for S4 data, 56% 
versus 30% at Pq = 10' 2 . We also note that the clustering 
procedure has more effect on the ROC curves in S4 than in 
S6 data. In the former case, the efficiency drops by 5 - 10% 
(compare to the curves in Figures 3a to 3c), whereas in the 
latter it stays practically unchanged (compare to Figures 3d 
to 3f). The reason for this is not clear. In the context of detec- 
tor evolution, the S6 data are much more relevant for advanced 
detectors. At the same time, we should caution that we use 
just one week of data from the S6 science run and larger scale 
testing is required for evaluating the effect of the detector’s 
non-stationarity. 

The ROC curves characterize the bulk performance of the 
classifiers, but they do not provide information about what 
kind of glitches are identified. To gain further insight into 
the distribution of glitches before and after classification, 
we plot cumulative histograms of the significance, p, in the 
gravitational-wave channel for glitches that remain after re- 
moving those detected by each of the classifiers at Pq < 10 -2 . 
We also plot a histogram of all glitches before any glitch re- 
moval. These histograms are shown in Figure 5. They show 
the effect of each classifier on the distribution of glitches in the 
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(a) S4 ROC curves 



(b) S6 ROC curves 


FIG. 4. Comparing algorithmic performance. We directly compare 
the performance of RF (green), ANN (blue), SVM (red), and OVL 
(light blue) using the full data sets. Glitches are clustered in time. We 
see that all the classifiers perform similarly, particularly in S6. There 
is a general trend of higher performance in S6 than in S4, which we 
attribute to differences in the types of glitches present in the two data 
sets. We should also note that all the MLA classifiers achieve perfor- 
mance similar to our benchmark, OVL, but RF appears to perform 
marginally better for a large range of the False Alarm Probability. 
The dashed line corresponds to a classifier based on random choice. 


gravitational-wave channel. In both the S4 and S6 data sets, 
the tail of the glitch distribution, which contains samples with 
the highest significance, is reduced. At the same time, as is 
clear from the plots, many glitches in the mid range of signif- 
icances are also removed, contributing to overall lowering of 
the background for transient gravitational-wave searches. The 
fact that our classifiers remove low significance glitches while 
some of the very high significance glitches are left behind in- 


dicates that there is no strong correlation between amplitude 
of glitches in gravitational-wave channel and their detectabil- 
ity using auxiliary channel information. This in turn implies 
that we either do not provide all necessary information for 
identification of these high significance glitches in the input 
feature vector or the classifiers somehow do not take advan- 
tage of this information. Given the close agreement between 
various classifiers that we observe in the ROC curves (Fig- 
ure 4) and the histograms of glitch distributions (Figure 5), the 
former alternative seems to be more plausible. Alternatively, 
our choices of the thresholds and the coincidence windows 
that went into the construction of the feature vectors might 
not be optimal. Also, heretofore unincluded features charac- 
terizing the state of the detector, which may amplify transient 
disturbances in the auxiliary channels and induce glitches in 
the gravitational-wave channel, might be crucial for identi- 
fying glitches missed in the current analysis. We leave the 
investigation of these possibilities to future work. 

Although the ROC curves (Figure 4) and the histograms 
(Figure 5) provide strong evidence that all classifiers detect 
the same glitches, they do not give a clear quantitive picture of 
the overlap between these methods. To see this more clearly, 
we define subsets of glitches based on which combination of 
classifiers detected them with a probability of false alarm less 
than 10” 2 . We determine overlaps between the MLA classi- 
fiers by constructing a bit-word diagram (Figure 6). It clearly 
demonstrates a high degree of redundancy between the clas- 
sifiers. The fraction of glitches detected by all three MLA 
classifiers is 91.3% for S6 data and 78.4% for S4 data. For 
comparison, we show on the same figure the bit-word diagram 
representation for clean samples that are falsely identified as 
glitches with probability of false alarm less than ICG 2 . The 
classifiers’ predictions for clean samples are distributed al- 
most uniformly. This suggests that our classifiers select clean 
samples nearly independently, or at least with a much lower 
level of correlation than for glitches. 

Next, we compare the MLA classifiers to OVL. In order to 
reduce the number of possible pairings, we combine the MLA 
classifiers following the maximum-likelihood-ratio algorithm 
described in more detail in the next Section VII. In short, this 
algorithm picks the most statistically significant prediction out 
of the three MLA classifiers for each event. We denote the 
combined classifier as MLA max . As in the previous case, we 
construct the bit-word diagram for both glitch and clean sam- 
ples detected with the probability of false alarm less than 10 2 
(Figure 7). The redundancy is even stronger; the fraction of 
glitches detected by MLA max and OVL is 94.8% for S6 data 
and 85.2% for S4 data. The full bit-word histograms show the 
same behavior and we omit them here. 


VII. METHODS FOR COMBINING CLASSIFIERS 

On a fundamental level, the MLA classifiers search for a 
one-parameter family of decision surfaces in the feature space, 
x £ Vd, by optimizing a detection criterion. The parameter 
labeling the decision surfaces can be mapped into a continu- 
ous rank, tmla(^) £ [0, 1]. This rank reflects the odds for a 
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FIG. 5. Comparing distribution of glitches before and after apply- 
ing classifiers at 1 % probability of false alarm. This cumulative 
histogram shows the number of glitches that remain with at least as 
high a significance in the gravitational-wave channel. We see that 
all our classifiers remove similar fractions of glitches at 1% proba- 
bility of false alarm. This corresponds to their similar performances 
in Figure 4, with efficiencies near 30% and 55% for S4 and S6 data, 
respectively. We also see that the classifiers tend to truncate the high- 
significance tails of the non-Gaussian transient distributions, partic- 
ularly in S6. 


sample, x, to correspond to a glitch in the gravitational-wave 
channel. As we discuss in Section III and Appendix A, theo- 
retically, if the classifiers use consistent optimization criteria, 
they should arrive at the same optimal decision surfaces and 
make completely redundant predictions. In other words, their 
ranks would be functionally dependent. In practice, however, 
different classifiers often lead to different results, primarily 
due to the limitations in the number of samples in the train- 



(a) S4 bit-word histogram for MLAs 



(b) S6 bit-word histogram for MLAs 

FIG. 6. Redundancy between MLA classifiers. These histograms 
show the fractions of detected glitches identified in common by a 
given set of classifiers at 1% probability of false alarm (blue). The 
abscissa is labeled with bit-words, which are indicators of which 
classifier(s) found that subset of glitches (eg: Oil corresponds to 
glitches that were not found by ANN, but were found by RF and 
SVM). The quoted percentages represent the fractions of detected 
glitches so that 100% represents those glitches which were success- 
fully identified by at least one of the classifiers at 1% false-alarm 
probability. The three classifiers show large overlap for glitch identi- 
fication (bit- word = 111), meaning the classifiers are largely able to 
identify the same glitch events. Also shown is the fraction of clean 
samples (green) misidentified as glitches, which shows a compara- 
tively flat distribution across classifier combinations. 


ing sets and/or computing resources. For instance, different 
classifiers may be more or less sensitive to different types of 
glitches. In this case, one should be able to detect a larger set 
of glitches by combining their output. Furthermore, the clas- 
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(b) S6 bit-word histogram for MLA max and OVL 

FIG. 7. Redundancy between MLA max and OVL. This figure is 
similar to Figure 6, except these histograms only compare the re- 
sults of combining the MLA classifiers into a single unified classifier 
(MLA max ) and OVL. Even though OVL only considers pairwise 
correlations between auxiliary channels and the gravitational-wave 
channel, we see that it identifies the same glitches as MLA max . This 
suggests that the glitches identified by the MLA classifiers are effec- 
tively characterized by pairwise correlations between a single auxil- 
iary channel and the gravitational-wave channel, and that consider- 
ing multi-channel correlations does not add much. We also see that 
these classifiers are highly correlated on their selection of glitches 
(blue), but less correlated across the set of misidentified clean sam- 
ples (green). 


sifters may be strongly correlated in the ranks they assign to 
glitch samples, but only weakly correlated when classifying 
clean samples. Again, by combining the output of different 
classifiers, we may be able to extract information about these 
correlations and improve the total efficiency of our analysis. 


This last case appears to be applicable to our data set. From 
Section VI, we see that at a probability of false alarm of 
1%, all classifiers remove nearly identical sets of glitches (to 
within 10% for the S6 data). However, the classifiers agree 
to a significantly lesser extent on the clean samples they re- 
move (Figure 6). This suggests that the correlations between 
the classifiers’ predictions are different for glitches and clean 
samples, and combining the classifiers’ output could possibly 
lead to an improved analysis. 

The general problem of combining the results from multi- 
ple, partially redundant analysis methods has been addressed 
in the context of gravitational-wave searches in [20]. Treating 
the output of the classifiers, namely their ranks, as new data 
samples, one arrives at the optimal combined ranking given 
by the joint likelihood ratio: 


Ajoint (T) ^p(T j 0) (13) 

where r = (r A NN> J’svmi t rf) is the vector of the MLA ranks 
assigned to a sample, x, and p(r | 1) and p(r | 0) are the prob- 
ability density functions for the rank vector in the case of 
glitch and clean samples, respectively. We should point out 
that we can modify this ranking by multiplying by the ratio of 
prior probabilities (p(l)/p(0)) to match the rankings for in- 
dividual classifiers without affecting the ordering assigned to 
samples. Typically, these conditional probability distributions 
are not known and computing the joint likelihood ratio from 
first principles is not possible. One has to develop a suitable 
approximation. We try several different approximations when 
combining algorithms. 

Our first approximation, and perhaps the simplest, esti- 
mates the likelihood ratio for each classifier separately and 
assigns the maximum to the sample. This method should be 
valid in the two limits: extremely strong correlations and ex- 
tremely weak correlations between the classifiers. It was first 
suggested and applied in the context of combining results of 
multiple gravitational -wave searches in [20]. We estimate the 
individual likelihood ratios in two ways: 1) as the ratio of 
cumulative density function (cdf) and 2) as the ratio of kernel 
density estimates for the probability density function (pdf). 
Though a proper estimate should involve the pdfs, the advan- 
tage of using cdfs is that we already calculate them when eval- 
uating the efficiency and probability of false alarm for each 
classifier. They should approximate the ratio of pdfs reason- 
ably well in the tail of the distributions, when the probability 
of false alarm is low. This assumes that pdfs are either slowly 
varying or simple (e.g. power law or exponential) decaying 
functions of the rank. However, at large values of the prob- 
ability of false alarm or in the case when the probability dis- 
tributions exhibit complicated functional dependence on the 
rank, our approximation may break down and we will have to 
resort to the more fundamental ratio of the pdfs. Explicitly, 
we estimate the joint likelihood ratio using 


Li(r) = max 


fr)p(r' |l)dr'| p l(r .) 

> = max ^ ; J [ . 

f r . p( r 'j | 0) dr' J r s MM 


(14) 
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We refer to this method as MLA max when introducing it in the 
context of Figure 7. 

We also construct smooth one -dimensional pdfs for clean 
and glitch samples from their ranks using Gaussian kernel 
density estimation [41]. These estimates were built using a 
constant bandwidth equal to 0.05 in the rank space, which 
ranges from 0 to 1. Based on this, we define the approximate 
combined rankings: 


L 2 {r) 


max 

r 3 


p( r j 1 1 ) ] 

P( r j |0)J 


(15) 


It is by no means true that we can always approximate the 
multi-dimensional likelihood ratio (13) with the maximum 
over a set of one-dimensional likelihood ratios. If we can 
better model the multi-dimensional probability distributions, 
we should be able to extract more information. To this end, 
we also implement a slightly more complicated combining 
algorithm. We observe that the algorithms are highly corre- 
lated on which glitches they remove, and less correlated on 
the clean samples (see Figure 6). We therefore approximate 
p[f\l) « max rj {p{rj | 1)} and p(f | 0) « flj P( r j I 0), 
which assumes that the algorithms are completely uncorre- 
lated for the clean samples. Aj 0 ; nt is then approximated by 


= max rj . {p(rj | 1)} 

n,p(^io) 


(16) 


Again, we compute the individual pdfs using Gaussian kernel 
density estimation. 

More subtle, but still useful, correlations between the ranks 
assigned by different classifiers can not be accounted for by 
these simple analytical approximations. Estimating the multi- 
dimensional probability distributions is a difficult task and 
under-sampling quickly becomes the dominant source of error 
when expanding to higher than two dimensions. Rather than 
developing a complicated analytic model, we can use one of 
the MLA classifiers to compute the combined rank. We use 
RF to attempt to combine the ranks from each classifier and 
construct an estimate of the full (three-dimensional) joint like- 
lihood ratio. 

We compare the methods for combining the classifiers by 
computing ROC curves, which are shown in Figure 8. We 
reproduce only the S6 curves because the S4 data shows the 
same trends. 

All combined methods result in very similar ROC curves 
and, when compared to the OVL curve, they do not seem to 
improve the overall performance by more than a few percent. 
These combined results lead us to conclude that the individual 
classifiers have already reached nearly optimal performance 
for the given input data, and that their combination, while 
increasing their robustness, can not improve the overall effi- 
ciency. Basically, all the useful information has been extracted 
already. 

Although it is not immediately apparent, these combining 
schemes do add robustness to our identification of glitches. 
The combining algorithms are able to ignore underperform- 
ing classifiers and reject noisy input fairly well, and we see 



FIG. 8. Comparison of different combining algorithms using S6 data. 
This figure compares the performance of our various schemes for 
combining the output of the three MLA classifiers. We note that all 
four algorithms, Li (??), L 2 (??), L3 (??), and using RF to clas- 
sify events based on the MLA output vector r, agree to a remarkable 
degree. The fact that our simple analytic algorithms perform just as 
well as the RF suggests that there are not many subtle correlations 
between the classifiers’ output. The MLA combining algorithms do 
not perform much better than OVL. Comparing these curves with 
Figure 4 shows that the combined performance does not exceed the 
individual classifier’s performances. This suggest that the individ- 
ual MLA classifiers each extract almost all of the useful information 
from our feature vectors, and that they identify the same types of 
glitches. These conclusions are further supported by Figure 6. 


that they tend to select the best performance from the indi- 
vidual classifiers. By comparing Figure 8 with Figure 4, we 
see that the combining algorithms follow the best ROC curve 
from figure Figure 4, even when individual classifiers are not 
performing equitably. This is most evident at extremely low 
probabilities of false alarm. This robustness is important be- 
cause it can protect a combined glitch identification algorithm 
from bugs in a single classifier. In this way, the combining 
algorithm essentially acts as an automatic cross-reference be- 
tween individual MLA classifiers. 


VIII. CONCLUSION 

In this study, we apply various machine learning algo- 
rithms to the problem of identifying transient noise artifacts 
(glitches) in gravitational-wave data from LIGO detectors. 
Our main goal is to establish the feasibility of using MLAs for 
robust detection of instrumental and environmental glitches 
based on information from auxiliary detector channels. We 
consider several MLA classifiers: the Artificial Neural Net- 
works, the Support Vector Machine, and the Random Lorest. 
We formulate the general detection problem in the context of 
glitch identification using auxiliary channel information and 
show that, theoretically, all classifiers lead to the same opti- 
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mal solution. In a real-life application, our classifiers have to 
monitor a large number of channels. Even after data reduc- 
tion, the dimensionality of our feature space could be as high 
as 1250, making classification a truly challenging task. We 
test classifiers using data sets from the S4 and S6 LIGO scien- 
tific runs. We use standard ROC curves as the main figure of 
merit to evaluate and compare the classifiers’ performances. 

Our tests show that the classifiers can handle extraneous 
features efficiently without affecting their performance. Like- 
wise, we find that the classifiers are generally robust against 
changes in the size of the training set. The most important 
result of our investigation is the confirmation that the MLA 
classifiers can be used to monitor a large number of auxiliary 
channels, many of which might be irrelevant or redundant, 
without a loss of efficiency. These classifiers can be used to 
develop a real-time monitoring and detector characterization 
tool. 

After establishing the robustness of the classifiers against 
changes in the input data and the presence of nuisance pa- 
rameters, we evaluate the algorithms’ performance in terms 
of the ROC curve and carry out detailed comparison between 
the classifiers. This includes their impact on the overall dis- 
tribution of glitches in the gravitational-wave channel and the 
redundancy of their predictions. We find that at a false alarm 
probability of 1%, all classifiers demonstrate comparable per- 
formance and achieve 30% and 56% efficiency at identify- 
ing single-detector glitches above our nominal threshold when 
tested on the S4 and S6 data, respectively. While not superb, 
this is a step toward the ultimate goal of being able to remove 
all artifacts from the data. 

One advantage of the MLA classifiers is that they provide 
a continues rank, Tmla €E [0, 1], rather than a binary flag. 
This statistic can be incorporated directly into the searches for 
gravitational-waves as a parameter characterizing a candidate 
event along with the rest of the data from the gravitational- 
wave channel, as opposed to a standard approach of vetoing 
entire segments of flagged data based on a hard threshold on 
data quality. 

Another advantage of the MLA classifiers is that they can 
incorporate various potentially diverse types of information 
and establish correlations between multiple parameters. In ad- 
dition to the transient noise data used in this study, we plan to 
include more slowly-varying baseline information about the 
detector subsystems in the future. Lor example, the quality of 
alignment in the interferometer may be important for predict- 
ing the amount of noise that couples into the gravitational- 
wave channel from elsewhere in the instrument. Machine 
learning should be able to automatically identify such non- 
linear correlations, even if they are not known previously. 

As a final test, we explore several ways of combining the 
output of several classifiers in order to increase the robust- 
ness of their predictions and possibly improve combined ef- 
ficiency. Lollowing general principles for combing multiple 
analysis methods, we suggest several approximations for the 
optimal combined ranking given by the joint likelihood ratio. 
We test our approximations and find that they perform simi- 
larly to and do not improve upon the efficiencies of individual 
classifiers. 


Based on these results, we conclude that the three MLA 
classifiers used in this study are all able to achieve robust 
and competitive classification performance for our set of data. 
The RL classifier was the most robust against the form (range, 
shape, scaling, number) of input data, while ANN and SVM 
benefit from reshaping certain input parameters along phys- 
ical arguments. Since all classifiers achieve similar limiting 
performance and identify most of the same events, we con- 
clude that they are near-optimal in their use of the existing 
data. Luture improvement in classification efficiency is there- 
fore likely to come from including additional sources of useful 
information, rather than refinements to the algorithms them- 
selves. 


ACKNOWLEDGMENTS 

KK, YMK, CHL, JJO, SHO, and EJS were supported in 
part by the Global Science experimental Data hub Center 
(GSDC) at KISTI. KK, YMK, and CHL were supported in 
part by National Research Loundation Grant funded by the 
Korean Government (NRL-201 1-220-C00029). CHL was 
supported in part by the BAERI Nuclear R&D program 
(M20808740002). JC, EOL, and XW were supported in part 
by the Ministry of Science and Technology of China un- 
der the National 973 Basic Research Program (grants No. 
2011CB302505 and No. 201 1CB302805). TY was supported 
in part by the National High-Tech Research and Develop- 
ment Plan of China (grant No. 2010AA012302). RE, KH, 
EK, and RV were supported by LIGO laboratory. LIGO 
was constructed by the California Institute of Technology and 
Massachusetts Institute of Technology with funding from the 
National Science Loundation and operates under cooperative 
agreement PHY-0757058 


Appendix A: Derivation of the Decision Surfaces for some MLA 
Optimization Schemes 

In Section III, we stress that for the detection (or the two- 
class classification) problem, the most natural optimizing cri- 
terion is the Neyman-Pearson criterion, which requires max- 
imum probability of detection at fixed probability of false 
alarm. Optimizing this criterion, which in functional form is 
given by (2), leads to the one-parameter family of decision 
surfaces defined as surfaces of constant likelihood ratio (5). 
Each decision surface is labeled by a corresponding value of 
the likelihood-ratio, A(x), providing a natural ranking for un- 
classified events. The higher the likelihood-ratio, the more 
likely it is that event belongs to Class 1 . The likelihood-ratio 
can be mapped to the particular value of the false alarm prob- 
ability, Pq(A), which assigns it a statistical significance. In 
practice, it is often more convenient to define ranking in terms 
of some monotonic function of the likelihood-ratio, r(A) (eg: 
r(A) = In A). Classifying and ranking samples based on 
the likelihood-ratio is guaranteed to maximize the ROC curve 

Pi(Po). 
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In their standard configurations, most MLA classifiers ap- 
ply other kinds of optimization criteria (e.g. the fraction 
of correctly classified events or the Gini index). Many of 
these criteria treat the two classes of the events symmetri- 
cally, which often is more appropriate than the asymmetric 
Neyman-Pearson criterion. In this section, we would like to 
explore some of the most popular criteria used by the MLA 
classifiers in their relation to the Neyman-Pearson criterion. 
Specifically, we are interested in establishing consistency be- 
tween the various criteria used in this study, in that they lead to 
the same optimal decision surfaces and compatible rankings. 


given by 


C=f 0(f(x)-F*)p(x\l)p(l)dx 

Jv , d 


+ [ Q(F* - f(x))p(x \0)p(0)dx. 

Jv , d 


(A3) 


The requirement that the variation of C with respect to /( x) 
must vanish, 


1. The Fraction of Correctly Classified Events 


SC=[ 6(f(x)-F*)6f(x ) 

Jv A 

x [p( x | l)p(l) — p(x | 0)p(0)] dx = 0 . (A4) 


First, we consider probably the most popular criterion - the 
fraction of correctly classified events. This criterion is used by 
both ANN and SVM in our analysis. Following the approach 
of Section III, we define it as a functional of the decision func- 
tion, fix), on the feature space, Vj: 


C = / 0 (/(x) - F*)p( 1 | x)p{x) da; 

Jv A 

+ [ Q(F* - f(x))p(0\x)p(x)dx, (Al) 
Jv d 

where p( 1 | x) and p( 0 | x) are the probabilities for a sample 
with feature vector x to be classified as Class 1 and Class 0, 
respectively, and p(x) is the probability distribution of obtain- 
ing a feature vector, x, regardless of its class. To elucidate 
this expression, we note that p(c \ x)p(x) da; corresponds to 
the fraction of total events that fall in the hyper-volume dx 
and are of class c. Without the 0 functions, these integrals 
would evaluate to C = p(l) + p(0) = 1, and we see that the 
0 functions select only those events that are correctly classi- 
fied. With this interpretation, /(x) is the decision function. 
For a given threshold F*, it defines two regions of samples 
as Class 1 and Class 0 through 0 (/(x) — F *) and its com- 
plement 0 (F* — fix)), respectively. Thus, the first term in 
(Al) accounts for correctly classified events of Class 1 and the 
second term does the same for Class 0 events. 

Using Bayes theorem, one can express p{ 1 | x)p{x) and 
p(0 | x)p(x) in the first and second term of (Al) as: 


pi 1 I x)p(x) = pix I l)p(l) , (A2a) 

p(0 | x)p(x) = p{x | 0)p(0) . (A2b) 

Here pix | 1) and p(x | 0), defined in Section III, are the prob- 
ability density functions for feature vectors in the presence 
and absence of a glitch in gravitational-wave data, respec- 
tively. pi 1) and p(0) are the prior probabilities for having a 
glitch or a clean datum, related to each other viap(l) +p(0) = 
1 as always. The fraction of correctly classified events is then 


leads to the following condition for the points, x* , satisfying 

fix*) - F* = 0: 


pix* | l)p(l) - pix* | 0)p(0) = 0 . (A5) 

This equation defines the decision surface consisting of points 
for which 


a ( ^Ei 1 } = pO* I iX 1 ) = 1 

1 >pi 0) - Pix* I 0)p(0) 


(A6) 


Thus, optimizing the fraction of correctly classified events 
leads to a specific decision surface, for which the likelihood 
ratio, A(x*) = p(0)/p(l). By construction, the optimization 
criterion (Al) treats events of the both classes symmetrically, 
maximizing the number of correctly classified events. As a 
result, it selects a specific decision surface for which evidence 
for the sample to belong to either one class or the other are 
equal. 


2. The Gini Index 

Next, we consider the Gini index criterion, which is used in 
RF. For two-class problems, the Gini index of a region in the 
feature space, V, is defined as 


GiV) = l-p 2 ~q 2 =2pq (Al) 

where p and q are fraction of Class 1 and Class 0 samples in 
the region, with p+q = 1. The Gini index for multiple regions 
is given by the average 

G = G(Vi)p(Vi) (A8) 


where p( Vf) is the probability for a sample to be in the region 

Vi. 

For the two-class problem, there are two distinct regions - 
the region where samples are classified as Class 1, Vi, and the 
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region where samples are classified as Class 0, Vo- We make 
the following definitions: 


P = [ ©(_/>( 1 \x)p(x)dx 

Jv , d 

Q = [ ©(/)p(0 \x)p(x)dx 

Jv, d 


>v d 


0(/)p(a 


lv d 


Q(f)p(x 


I l)p(l)dx, 
(A9a) 

| 0)p(0) dx , 
(A9b) 


where P (Q) is the probability that a glitch (clean sample) 
will fall into V\ and 0(/) is shorthand for 0 (/(x) — F*). 
We recognize that the expected fractions of events in V± can 
be described as p = P/p(V l) and q = Q/p{V±), where 
p(Vi ) = f v 0(f)p( x) dx is the probability for any event (ei- 
ther glitch or clean sample) to fall in V\. Furthermore, we can 
immediately write the corresponding relations for Vo in terms 
of P, Q, and p(V i) = 1 — p(Vo). We then obtain: 


G PQ (p(l) - P) (p(0) - Q) 
2 p(Vi) 1 -p(Vi) 


(A10) 


where p( 1) and p( 0) are the prior probabilities for an event to 
be Class 1 and Class 0, respectively. We also note that p( Vi ), 
P, and Q are functionally related through 


p(Vi)= [ &{f)p(x) dx 
Jv A 


= / ©(/) [p{ 1 I x)p(x) +p( 0 | x)p(x)] dx 

Jv , d 


= P + Q. 


(Alla) 


We now optimize the Gini index, (??), to determine the op- 
timal decision surface. For simplicity, let us first optimize G 
while holding p{V\ ) constant. We are interested in the optimal 
decision surface’s shape, and this optimization will determine 
it while keeping the ratios of the number of samples in V\ and 
Vq constant. We obtain 


1 SG 

2 <5/(x) 


p(x*)Q + q(x*)P p(x*)Q + q(x*)P 


P (Vi) 

p(l)q(x*) + p(0)p(x*) 

l-p(Vi) 


1 -p(vi) 


(A12) 


where 

P( x *) = TJT^: = Sx ,x* [ Hf( x )^ f *)p( x l 1 )p( 1 )dx, 
Sf(x*) J Vd 

(A 13a) 

g ( x *) = r F *)p{ X |0)p(0)dx, 

°}\ x ) Jv A 

(A 13b) 

where 5 XiX * = Sf(x)/Sf(x*) = {1 if x* = x,0 otherwise}. 
We see thatp(x*) and q(x*) are probability density functions 
defined on the decision surface (x* € {/(x*) = F*}). We 
can write these variations in terms of conditional probabilities 


on the decision surface. 


p{x*) = p{x* | l)p(l) , (A 14a) 

q(x*) = p(x* | 0)p(0) . (A14b) 


Requiring the variation of the Gini index, (A 12), to vanish 
leads to the following relation 


P{x*) 

q{x*) 


p(l)p(Vl) - P 

p{0)p(Vi) -Q' 


(A15) 


Identifying the left hand side of this equation with the 
likelihood ratio for a point on the decision surface, 

A(x*)(p(l)/p(0)) = p{x* | l)p{l)/p{x* | 0)p(0), and using 
(A1 1) we find that 


Afx*) ^ 1 ) = 1 

H J p(0) ’ 


(Alb) 


which will hold for all points on the decision surface. Re- 
markably, this condition is independent of P, Q and p(V i). 
As in the case of the fraction of correctly classified events 
(A6), it implies that the optimal decision surface is the surface 
on which the likelihood ratio is equal to a ratio of the priors. 

If we consider the more general maximization problem, in 
which we allow p(V\) to vary, we must maximize 


PQ , (p( 1) - P) (p(0) - Q) 


o - i /t/ \ +A (p(Vi) — P — Q) . 

2 p(V l) l-p(Vi) 

(A17) 

where we use a Lagrange multiplier (A) to enforce the condi- 
tion p(Vi) — P — Q = 0 and consider variations of p(Vi)to be 
independent of variations in /(x). 

The variation with respect to p{V\ ) defines the Lagrange 
multiplier. 


A _. PQ (1 ~~ 2p(Vi)) — p(Vi) 2 [p(i)p(o) — p{P)Q — p(o)p] 

p(^l) 2 (l-p (^)) 2 

(A18) 

Variation with respect of /(x) is given by (A12) minus 
A (p(x*) + q(x*)). Setting it to zero for all independent varia- 
tions of f(x*) leads to a more general condition on the likeli- 
hood ratio: 


A(x*) = 


Ap(Vt) (1 —p(Vi)) + p(l)p(Vi) ~ P 
Ap(Vi) (1 -p(Vi)) +p(0)p(Vi) - Q ’ 


which holds separately for all points x* on the decision sur- 
face. 

First of all, note that this condition still requires a constant 
likelihood-ratio on the decision surface. For each point on the 
surface, A(x*) is determined by P and Q, which are constants 
for a given surface. We recover (A15) and (A16) when A = 0. 
In all other cases, the likelihood ratio on the decision surface 
is given by a quite complicated expression, obtained by plug- 
ging (A18) and (All) into (A19). In practice, the likelihood 
ratio on the decision surface is set by the desired value for the 
probability of false alarm, Po, but the ratio will be constant 
over the entire surface. Optimization of the Gini index, then, 
will be equivalent to optimizing the ROC curve in the region 
of interest, e.g. in our study near Pq = 10” 2 . 
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3. Asymmetric Criteria 

Both criteria considered so far are symmetric in their treat- 
ment of events in Class 1 and Class 0. While the asymme- 
try can be imposed by tuning the ratio of prior probabilities 
(p(l)/p(0)), in some cases it might be desirable to use an ex- 
plicitly asymmetric criteria. The RF implementation in the 
StatPatternRecognition package contains two different asym- 
metric criteria, which we explore in our study: Signal Purity 
(P) and Signal Significance ( S ) [40]. By construction, they 
identify one class of events as signal and the other as back- 
ground and place more emphasis on correctly classifying sig- 
nal rather than background. 

The Signal Purity and the Signal Significance are defined 
as 


p = 

u 1 

(A20) 

Wi + Wo ’ 



s = 

W 1 

(A21) 

. A. ,2 1 ,.,2 ' 


where w i and w o are the fraction of Class 1 (signal) and Class 
0 (background) events in the signal region. The aim is to iden- 
tify the signal region with the highest Signal Purity or Signal 
Significance. In the process of the decision tree construction, 
the classifier identifies the terminal nodes with the highest val- 
ues of P or S and orders nodes using these criteria as a rank. 


For a given terminal node of a tree, one can express wi and wo 
in terms of conditional probabilities: 

w i = p(x 1 1) , (A22) 

Wo = p(x I 0) . (A23) 


In terms of these, one can rewrite the signal purity and the 
signal significance as 


p{x 1 1 ) = A(x) 

p{x 1 1 ) + p(x | 0) A(x) + 1 ’ 

p(x 1 1 ) = AQ) 

\Jp{x | l) 2 +p(x | 0) 2 y/ A 2 (x) + 1 


(A24) 

(A25) 


Both quantities are monotonic functions of the likelihood ratio 
A(x) = p{x 1 1 )/p(x | 0). Ordering nodes in the tree would 
be equivalent to ranking events by the likelihood ratio, which 
in turn is equivalent to using decision surfaces of the constant 
likelihood ratio for classification. 

In practice, different classifiers have various limitations 
which results in suboptimal performance. Depending on the 
application, one algorithm or criteria may be more optimal 
than another, but we establish here that on a theoretical level 
they all recover the same optimal solution. In the two-class 
classification problem, the decision surfaces are surfaces of 
constant likelihood ratio, which also defines the optimal rank- 
ing for samples. 
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